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Abstract 

The  University  of  Maryland  and  Johns 
Hopkins  University  worked  together  in 
the  2004  High  Accuracy  Retrieval  from 
Documents  (HARD)  track  to  explore  de¬ 
sign  options  for  interactive  passage  re¬ 
trieval  systems.  HARD  assessors  re¬ 
sponded  to  clarification  forms  by  (1) 
selecting  additional  search  terms  from 
an  automatically  constructed  list  of  po¬ 
tentially  discriminating  terms,  (2)  se¬ 
lected  relevant  passages  from  an  auto¬ 
matically  constructed  list  of  possibly  rel¬ 
evant  passages,  and  (3)  entered  addi¬ 
tional  search  terms.  Query  expansion 
based  on  these  three  types  of  elicited 
information  yielded  statistically  signifi¬ 
cant  improvements  in  R-precision  over 
baselines  with  and  without  blind  rel¬ 
evance  feedback.  For  topics  that  re¬ 
quested  passages  as  answers,  a  prelimi¬ 
nary  analysis  shows  that  statistical  mod¬ 
els  for  passage  extent  trained  on  HARD 
2003  data  yielded  a  significant  improve¬ 
ment  over  a  replication  of  the  Univer¬ 
sity  of  Maryland’s  HARD-2003  tech¬ 
nique  for  passage  extent  determination, 
and  the  results  of  the  new  technique  ap¬ 
peal-  to  generally  be  well  above  the  me¬ 
dian  for  HARD  2004  systems. 
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1  Introduction 

An  Information  Retrieval  (IR)  process  can  be  mod¬ 
eled  as  establishing  relationships  between  queries 
entered  by  a  user  and  the  documents  in  a  collec¬ 
tion.  In  such  a  model,  evidence,  usually  counts 
of  content-bearing  words  drawn  from  entire  docu¬ 
ments,  is  used  as  a  basis  for  assessing  the  strength 
of  a  relationship.  Some  research  has  also  focused 
on  modeling  portions  of  a  document  text  (called 
“passage  retrieval”),  finding  that  passage-level  ev¬ 
idence  can  sometimes  provide  better  evidence  for 
the  full-document  retrieval  task  than  that  of  the  full 
document  text,  especially  when  the  documents  are 
long  or  span  different  subject  areas  (Callan,  1994; 
Kaszkiel  and  Zobel,  1997). 

We  started  to  work  on  passage  retrieval  at  the 
University  of  Maryland  in  the  2003  High  Accu¬ 
racy  Retrieval  of  Documents  (HARD)  track.  Our 
interest  is  motivated  by  the  novel  design  of  the 
HARD  passage  retrieval  evaluation;  in  HARD  pas¬ 
sages  are  assessed  based  on  their  intrinsic  utility 
to  searchers  as  passages  (rather  than  their  extrinsic 
value  as  a  basis  for  retrieval  of  full  documents). 
For  our  2003  experiments  we  developed  a  sim¬ 
ple  but  effective  module  to  identify  and  rank  pas¬ 
sages;  it  achieved  an  R-precision  among  the  best 
reported  that  year.  However,  a  subsequent  inter¬ 
annotator  consistency  study  conducted  at  the  Uni¬ 
versity  of  Maryland  showed  that  our  HARD-20003 
passage  retrieval  module  was  far  below  human  per¬ 
formance  on  the  same  task.  Our  analysis  indicated 
that  the  most  problematic  part  of  our  approach 
that  year  was  passage  extent  determination;  our 
passages  were  generally  far  shorter  than  the  pas¬ 
sages  annotated  by  the  HARD  assessors.  There¬ 
fore,  the  first  research  question  we  wanted  to  ad¬ 
dress  was  how  we  might  better  approximate  hu- 
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man  determination  of  passage  extent.  The  Johns 
Hopkins  University  joined  our  team  this  year,  and 
they  focused  on  this  challenge.  We  developed  a 
set  of  paragraph-based  features  and  some  statisti¬ 
cal  models  to  identify  the  most  likely  passage  ex¬ 
tent  for  a  query  in  its  corresponding  retrieved  doc¬ 
uments 

Our  second  research  question  focused  on  opti¬ 
mizing  the  utility  of  a  limited  opportunity  for  user 
interaction.  Previous  research  on  presentation  of 
passages  in  interactive  information  retrieval  has  fo¬ 
cused  on  the  display  of  passages  as  a  basis  for  doc¬ 
ument  selection  task  (Knaus  et  ah,  1995;  He  et  al., 
2004).  Our  goal,  by  contrast,  was  to  use  passage- 
level  feedback  to  improve  passage  retrieval  effec¬ 
tiveness.  We  explored  this  by  using  passage  selec¬ 
tion  and  term  selection  in  the  design  of  our  clarifi¬ 
cation  forms,  and  then  using  the  results  as  a  basis 
for  automatic  query  expansion. 

In  this  report,  we  first  introduce  our  2003  pas¬ 
sage  retrieval  module  in  section  2.1,  then  describe 
the  design  of  the  new  passage  extent  model  in  sec¬ 
tion  2.2.  We  then  move  to  a  discussion  of  the  de¬ 
sign  and  use  of  clarification  questions  for  improv¬ 
ing  passage  retrieval  in  section  3.  We  conclude 
with  a  preliminary  analysis  of  the  experiment  re¬ 
sults  in  section  4. 

2  Passage  Retrieval 

Perhaps  the  greatest  challenge  in  the  design  of  an 
end-user  passage  retrieval  system  is  that  there  is 
little  a  priori  basis  for  determining  passage  extent; 
some  users  may  prefer  terse  passages,  while  oth¬ 
ers  may  prefer  more  context.  Learning  from  ex¬ 
amples  can  be  useful  in  such  cases,  but  only  when 
users  exhibit  some  degree  of  agreement  regarding 
the  desired  passage  length.  In  2003,  no  training 
examples  were  available,  however.  We  therefore 
adopted  a  simple  ad-hoc  approach  for  passage  ex¬ 
tent  determination  in  HARD  2003.  Since  then,  we 
have  run  a  small  inter-annotator  agreement  study, 
concluding  that  annotation  consitency  would  be 
adequate  to  detect  further  improvements  over  our 
HARD  2003  system  (Dina  Demner  Fushman  et  al., 
2004).  We  therefore  developed  a  new  system  for 
passage  extent  determination  that  is  trained  on  the 
LDC  HARD  2003  passage  extent  judgments.  Both 
the  old  and  the  new  system  are  described  in  this 
section. 


2.1  The  2003  Passage  Retrieval  Module 

Leveraging  previous  research  on  passage  re¬ 
trieval  (Liu  and  Croft,  2002;  Kaszkiel  and  Zo- 
bel,  1997),  our  2003  passage  retrieval  module  was 
based  on  assumptions  that  the  relevance  of  a  pas¬ 
sage  to  a  given  query  is  related  to: 

•  the  computed  overall  probability  of  relevance 
for  the  document  that  contains  the  passage; 

•  the  density  of  the  query  terms  appealing  in 
the  passage; 

•  the  importance  of  the  query  terms  appealing 
in  the  passage. 

For  our  2003  passage  retrieval  module,  we  used 
(1)  Inquery  scores  as  a  surrogate  for  the  probabil¬ 
ity  of  relevance  of  the  documents,  (2)  the  number 
of  different  query  terms  appealing  in  the  passage, 
and  how  close  their  positions  in  the  passages  are, 
as  the  representation  of  the  query  term  density; 
and  (3)  TFIDF  weights  of  the  query  terms  in  the 
passage,  normalized  for  passage  length,  and  ad¬ 
justed  by  relative  importance  factors  assigned  to 
each  query  term  based  on  its  source  (e.g.,  title  field, 
clarification  form,  or  blind  relevance  feedback)  as 
the  representation  of  the  importance  for  each  term. 
We  formed  a  linear  combination  of  these  three  fac¬ 
tors,  gaving  more  emphasis  to  the  document  scores 
because  Inquery  scores  have  been  demonstrated  to 
be  a  useful  approximation  to  document  relevance, 
whereas  it  was  the  first  time  we  had  tried  the  other 
two  factors. 

This  approach  implicitly  assumes  that  passages 
have  a  known  extent,  but  it  offers  no  guidance  on 
what  that  extent  should  be.  We  chose  to  model 
variable  passage  extents  rather  than  using  a  fixed 
window  size  because  that  choice  allows  us  to  take 
advantage  of  paragraphs,  a  meaningful  structural 
unit  that  is  assigned  by  the  author  of  the  docu¬ 
ments. 

Our  passage  retrieval  model  identifies  each  in¬ 
stance  of  a  query  term  and  then  extends  the  pas¬ 
sage  to  the  nearest  paragraph  boundary  in  each  di¬ 
rection.  When  there  is  no  paragraph  markup  in  the 
document,  we  use  a  fixed  40-words  windows  size 
around  the  query  term  as  the  passage  extent.  When 
two  passages  containing  a  query  term  arc  adjacent, 
we  merge  them  into  a  single  larger  passage. 

Passages  were  ranked  in  decreasing  order  of 
these  scores,  and  top  1000  passages  were  returned 


ically  by  our  2003  passage  retrieval  system;  these 
probabilities  were  used  for  the  HARD  2004  evalu¬ 
ation. 


Figure  1:  The  Hidden  Markov  chain,  with  two 
states  and  four  possible  transitions.  The  output  of 
the  HMM  is  considered  to  be  “emitted”  either  from 
a  state  or  a  transition. 

for  each  query.  In  2003,  we  limited  the  number  of 
passages  from  a  single  document  to  the  best  three; 
based  on  our  2003  results,  we  have  removed  that 
restriction  for  2004. 

2.2  A  Statistical  Model  for  Passage  Extent 
Determination 

For  2004,  we  focused  on  improving  passage  re¬ 
trieval  by  exploring  (i)  a  novel  Hidden  Markov 
Model  (HMM)  approach,  (ii)  the  application  of 
Lineal-  Discriminant  Analysis  (LDA),  and  (iii)  a 
voting  scheme  among  multiple  classifiers. 

Our  analysis  of  HARD  2003  data  showed  that 
the  atomic  units  for  retreived  passages  are  para¬ 
graphs',  assessors  worked  on  HARD  2003  data 
were  never  observed  to  choose  sub-paragraph  units 
as  a  passage.  Therefore,  we  model  each  paragraph 
in  a  document  as  being  in  one  of  two  states:  “rele¬ 
vant”  and  “irrelevant”.  A  “relevant”  paragraph  ap¬ 
peal's  in  the  system  output  as  part  of  a  “relevant” 
passage. 

It  is  natural  to  model  the  consecutive  paragraphs 
in  a  document  as  changing  their  states  according  to 
the  transitions  of  a  Hidden  Markov  chain  (see  Fig¬ 
ure  1).  In  this  model,  the  probability  that  a  para¬ 
graph  is  relevant  or  irrelevant  depends  on  the  char¬ 
acteristics  of  the  current  paragraph  and  the  state 
of  the  immediately  preceding  paragraph.  Table  1 
shows  the  transition  probabilities  of  the  Markov 
chain,  trained  on  a  mixture  of:  (i)  the  HARD  2003 
LDC  passage  retrieval  relevance  documents,  and 
(ii)  the  top  100  passages  that  were  found  automat- 


to  relevant 

to  irrelevant 

from  relevant 

0.87 

0.13 

from  irrelevant 

0.04 

0.96 

Table  1 :  The  transition  probabilities  between  “rel¬ 
evant”  and  “irrelevant”  paragraphs. 


We  then  assumed  that  the  output  of  the  HMM 
is  a  Gaussian-distributed  scalar.  Depending  on  the 
particular  model,  this  scalar  may  be  “emitted”  at 
each  state,  or  at  each  state  transition.  In  both 
cases,  this  scalar-valued  feature  is  equal  to  a  lin¬ 
eal'  combination  of  various  similarity  measures  be¬ 
tween  the  query  and  the  paragraph,  between  ad¬ 
jacent  paragraphs,  etc.  The  weights  of  the  lin¬ 
eal'  combination  can  be  determined,  during  the 
training  phase,  using  Linear  Discriminant  Analy¬ 
sis  (LDA),  based  on  the  ground  truth. 

We  employed  the  following  set  of  similarity 
measures  for  paragraph  i  as  the  scalar-valued  fea¬ 
tures  in  our  LDA  model.  All  the  paragraphs 
are  preprocessed  with  stemming  and  stopword  re¬ 
moval,  and  the  temporal  sequence  of  paragraphs  in 
each  document  was  preserved  during  the  calcula¬ 
tion. 

1.  Paragraph  features:  Similarity  of  the  i-th 
paragraph  with  the  query  (a)  Title,  (b)  De¬ 
scription,  (c)  Narrative,  and  (d)  the  ’’nega¬ 
tive”  portion  of  the  query  Narrative.  (4  di¬ 
mensions)1 

2.  Document  features:  Similarity  of  the  entire 
document  with  the  (a)  Title,  (b)  Description, 
(c)  Narrative,  and  (d)  the  ’’negative”  portion 
of  the  Narrative.  These  provide  a  baseline  for 
interpreting  the  similarity  scores  1(a)- 1(d)  of 
individual  paragraphs  within  the  document. 
(4  dimensions) 

3.  Document-minus-paragraph  features:  Simi¬ 
larity  of  the  document,  less  the  i-th  paragraph, 
with  the  (a)  Title,  (b)  Description,  (c)  Narra¬ 
tive,  and  (d)  the  ’’negative”  portion  of  the  Nar¬ 
rative.  (4  dimensions) 

'By  ’'negative,”  we  mean  the  text  segment  in  a  nan'ative 
that  describes  what  the  retrieved  passages  ’’should  not”  con¬ 
tain.  Such  segments  are  found  automatically  by  detecting  the 
presence  of  cue  phrases  such  as  "should  not  contain.” 


4.  Inter-paragraph  similarity:  Similarity  be¬ 
tween  the  i-th  and  the  (i-l)-th  paragraph.  (1 
dimension) 

5.  Delta-features:  The ’’temporal”  derivatives  of 
the  paragraph  based  features  l(a)-l(d),  3(a)- 
3(d)  and  4.  (9  dimensions) 

All  similarities  are  computed  using  the  Okapi 
formula  (Robertson  et  al.,  1994),  where  the  inverse 
“document”  frequencies  arc  computed  at  the  para¬ 
graph  level  (i.e.,  they  arc  inverse  paragraph  fre¬ 
quencies). 

The  elements  of  the  above  22-dimensional  vec¬ 
tor  arc  linearly  combined  through  3  sets  of  LDA 
coefficients:  one  set  was  trained  assuming  that  the 
vector  was  emitted  from  the  HMM  state;  the  other 
2  sets  were  trained  assuming  that  the  previous  state 
was  relevant  or  irrelevant,  respectively  (that  is,  the 
vector  was  emitted  by  the  transition  of  the  HMM). 

During  both  training  and  testing,  for  each  para¬ 
graph  of  each  document,  we  computed  scalar 
quantities  equal  to  the  linear  combinations  of  the 
various  similarity  values  and  differences,  with 
weights  obtained  through  LDA  training.  Thus,  we 
obtain  3  scalars:  one  assumed  to  be  the  output  of 
an  HMM  with  2  conditional  output  distributions 
(one  per  state);  and  two  scalars  assumed  to  be  the 
output  of  an  HMM  with  4  conditional  output  distri¬ 
butions  (one  per  transition),  where  the  two  scalars 
correspond  to  two  possible  originating  states. 

For  each  one  of  the  HMMs  (one  with  outputs 
on  states,  and  one  with  outputs  on  transitions), 
we  compute  the  likelihood  of  the  observed  output 
using  the  forward-backward  equations  (Jelinek, 
1997);  then,  we  pick  the  state  sequence  which  min¬ 
imizes  the  state  error  (maximum  aposteriori  esti¬ 
mation). 

Moreover,  in  addition  to  the  HMM  detectors,  we 
used  a  collection  of  very  conservative  classifiers, 
which  exploit  some  trends  that  were  observed  in 
the  HARD  2003  data  (and  we  assumed  that  these 
trends  will  also  hold  for  HARD  2004).  Specifi¬ 
cally,  we  built  the  following  8  classifiers  for  find¬ 
ing  relevant  paragraphs  in  every  retrieved  docu¬ 
ment: 

1.  The  paragraph  with  the  highest  similarity  to 
the  title  field  of  the  query  is  marked  as  rele¬ 
vant. 

2.  The  paragraphs  with  the  two  highest  differ¬ 
ences  of  similarities  (to  the  query’s  title)  from 


the  similarities  of  the  preceding  paragraphs 
are  marked  as  relevant. 

3.  For  each  document,  we  expressed  the  simi¬ 
larities  of  paragraphs  to  their  preceding  para¬ 
graphs  as  a  time  series,  and  we  computed 
its  Fourier  transform.  Then,  we  set  all 
paragraphs  of  a  document  to  be  relevant,  if 
the  bandwidth  of  the  computed  spectrum  is 
among  the  lowest  10%  bandwidths  of  all  re¬ 
turned  documents  for  a  given  topic.  (By 
bandwidth  we  mean  the  range  of  frequen¬ 
cies  which  contains  most  of  the  signal  en¬ 
ergy.)  The  rationale  behind  this  technique 
is  that  documents  which  arc  pretty  homoge¬ 
neous  (i.e.,  all  paragraphs  arc  on-topic)  have 
slow  variation  in  in  ter- paragraph  similarities 
(hence,  small  spectral  bandwidth). 

4-7.  Similar-  to  1-2  above,  but  with  similarity  to  the 
query’s  description  and  narrative. 

8.  The  paragraph  with  the  highest  weighted  sum 
of  the  values  of  its  22-dimensional  vector  (de¬ 
scribed  above)  is  marked  as  relevant.  The 
weights  were  chosen  empirically,  based  on 
the  HARD  2003  data. 

Finally,  each  paragraph  of  each  document  is 
scored  according  to  two  schemes: 

•  Score  1:  The  number  of  classifiers  which 
classify  the  paragraph  as  relevant  (integer¬ 
valued).  If  no  classifier  mark  it  as  relevant, 
and  the  adjacent  paragraphs  have  non-zero 
scores,  then  Score  1  is  equal  to  Score  2  (other¬ 
wise,  if  the  adjacent  paragraphs  were  not  clas¬ 
sified  as  relevant  by  any  classifier,  Score  1  is 
negative,  and  proportional  to  the  “bandwidth” 
of  the  document). 

•  Score  2:  The  average  of  two  normalized  like¬ 
lihoods  of  the  paragraph,  with  respect  to  the 
two  HMMs. 

Since  our  passage  retrieval  model  operates  on 
the  output  of  the  document  retrieval  results,  we 
trained  on  a  mixture  of  documents:  those  which 
were  truly  relevant  (obtained  from  the  golden 
truth),  and  the  documents  which  contained  the  top- 
100  passages  that  our  document  retrieval  system 
had  produced  for  the  2003  evaluation.  We  did 
a  10-fold  cross-validation.  Table  2  shows  the  R- 
precision  obtained  on  HARD  2003  data,  for  differ¬ 
ent  scoring  schemes  and  test  sets:  (i)  truly  relevant 


documents,  obtained  from  the  golden  truth;  (ii) 
The  subset  of  UMD’s  (2003)  retrieved  documents 
that  contained  the  top- 100  passages  for  that  topic; 
and  (iii)  top- 1000  documents  per  topic.  The  mea¬ 
sure  reported  throughout  this  report  is  R-Precision 
because  this  was  the  measure  we  used  last  year  and 
during  our  training. 


Score  1 

Testing  on: 

R-Precision 

Truly  relevant  docs 

0.51 

Top- 100  passages 

0.37* 

Top- 1000  docs 

0.23 

Score  2 

Testing  on: 

R-Precision 

Truly  relevant  docs 

0.49 

Top- 100  passages 

0.29 

Top- 1000  docs 

0.12 

Table  2:  The  retrieval  effectiveness  (R-Precision) 
obtained  by  JHU  passage  retrieval  models  on  2003 
HARD  data. 

The  37%  precision  (marked  with  *  above)  is  sig¬ 
nificantly  higher  than  the  32%  R-Precision  that  the 
UMD  passage  retrieval  system  achieved  during  the 
HARD  2003  evaluation. 

Furthermore,  one  can  see  that,  on  average.  Score 
1  gives  consistently  better  R-precision.  For  that 
reason,  it  was  chosen  as  the  first  submission  in  the 
HARD  2004  evaluation. 

During  the  development  of  the  models,  we 
also  explored  integrating  Marti  Hearst’s  TextTiling 
system  (Hearst,  1997)  into  our  passage  retrieval 
model,  where  “tiles”  were  treated  as  atomic  units 
rather  than  natural  paragraphs.  As  shown  in  ta¬ 
ble  3,  using  2003  passage  retrieval  data,  the  R- 
Precision  under  Score  1  was  obtained  for  two  Text- 
Tiling  parameter  values  (w=l  and  w- 20),  and  fol¬ 
lowing  a  10-fold  cross-validation  procedure  on  the 
1042  truly  relevant  documents  and  the  top- 1000 
documents.  In  both  cases,  the  R-Precision  is  sig¬ 
nificantly  lower  than  the  one  obtained  when  the 
atomic  blocks  are  paragraphs. 

3  Clarification  Questions 

Communicating  through  clarification  forms  pro¬ 
vides  each  site  a  means  to  interact  with  the  peo¬ 
ple  who  proposed  the  search  topic.  We  mod¬ 
eled  the  communication  as  a  simplifed  need  ne- 


w 

R-Precision 

Truly  relevant  docs 

Top- 1000  docs 

7 

0.45 

0.19 

20 

0.37% 

0.11 

Table  3:  The  Passage  Retrieval  Results  of  using 
“tiles”  as  atomic  units 

gotiation  process  in  our  last  year's  HARD  ex¬ 
periment  (He  and  Demner-Fushman,  2003).  Our 
work  demonstrated  that  such  interaction  can  be 
used  to  elicit  several  types  of  information,  includ¬ 
ing  relevance  feedback,  extra  information  about 
user’s  need,  user’s  characteristics  and  user’s  pref¬ 
erences  (He  and  Demner-Fushman,  2003).  This 
year  we  mainly  concentrated  on  just  two  of  them 
-  relevance  feedback  and  extra  information  of  the 
need,  since  they  were  found  to  be  the  most  useful 
information  last  year. 

As  stated,  the  effectiveness  of  our  passage  re¬ 
trieval  module  depends  on  the  quality  of  the  docu¬ 
ment  returned,  the  query  terms  for  finding  the  pas¬ 
sage  locations,  and  the  extent  of  the  passage.  The 
passage  extent  problem  was  mainly  addressed  in 
section  2.2,  however,  we  also  took  the  chance  of 
the  interaction  in  clarification  forms  to  ask  user’s 
performance  of  the  passage  length.  One  of  the  clar¬ 
ification  question  was 

You  expect  your  information  need 
to  be  fulfilled  in/by: 

1.  One  or  two  sentences  in  a 
paragraph 

2.  One  or  two  paragraphs 

3.  Several  paragraphs  in  a 
document 

4.  Several  paragraphs  in 
several  documents 

Eliciting  named  entities  was  demonstrated  to  be 
an  effective  approach  for  improving  the  search  re¬ 
sults  in  our  last  year  experiment  (He  and  Demner- 
Fushman,  2003),  we,  therefore,  employed  similar 
questions  in  this  year  clarification  forms  for  name 
entities.  We  specifically  worked  on  three  types 
of  named  entities  -  personal  names,  organization 
names,  and  locations,  all  of  which  would  give  us 
phrases  or  other  unique  content  words. 

Our  named  entities  related  questions  included 
relevance  feedback  questions,  in  which  the  terms 
were  first  identified  by  BBN  IdentiFinder  (bbn,  ), 
then  selected  based  on  the  phrase’s  TFIDF  scores 
from  the  top  10  returned  documents.  To  satisfy  the 


space  restrictions,  we  only  selected  top  5  ranked 
phrases  for  personal  names,  organization  names, 
and  locations  respectively.  The  questions  also  in¬ 
cluded  elicitation  of  extra  terms  in  the  same  type. 

The  majority  of  clarification  questions  were 
dedicated  to  relevance  feedback  to  returned  pas¬ 
sages  if  the  users  wanted  passages  as  the  preferred 
result  format,  or  returned  documents  if  otherwise. 
We  knew  from  the  topic  metadata  about  the  users’ 
preference. 

No  matter  which  preference,  we  based  the  gen¬ 
eration  of  clarification  questions  on  the  outcomes 
of  our  2003  passage  retrieval  module.  This  was 
decided  when  we  want  to  show  the  passages  them¬ 
selves  if  passages  are  to  be  judged  since  there 
would  be  some  information  lost  no  matter  how 
good  the  summarization  is,  and  our  passage  in  gen¬ 
eral  was  short  to  be  fit  into  the  clarification  forms. 
We  identified  that  there  is  the  screen  space  to  up 
to  five  passages.  If  documents  are  to  be  judged, 
we  are  forced  to  use  the  surrogates,  which  arc  the 
concatenation  of  all  the  passages  from  those  docu¬ 
ments  that  arc  ranked  within  top  1000. 

We  had  a  choice  of  selecting  top  five  ranked 
passages,  or  selecting  top  five  passages  from  dif¬ 
ferent  sub-topic  areas.  We  designed  our  clarifica¬ 
tion  forms  around  the  latter  to  let  the  user  to  view 
as  many  sub-topic  areas  as  possible,  and  to  maxi¬ 
mize  the  possibility  that  some  displayed  passages 
arc  relevant  even  when  the  returned  results  were 
in  poor  quality.  We  designed  a  Maximum  Mar¬ 
ginal  Relevance  (MMR)  like  selection  scheme  to 
achieve  the  purpose. 

Zhai  defined  MMR  as  a  scheme  that  is  capable 
of  considering  both  the  relevance  and  the  novelty 
of  returned  documents  (Zhai,  2002).  Our  MMR 
like  selection  scheme  reflects  this  thinking.  To 
maintain  the  relevance  of  the  selected  passages,  we 
only  chose  passages  that  were  ranked  at  top  200. 
Our  selection  of  the  number  200  was  essentially 
ad-hoc,  but  it  is  a  big  number  to  include  adequate 
number  of  different  passages,  and  at  the  same  time 
these  passages  arc  relative  top  ranked  to  maintain 
some  relevance. 

The  novelty  in  our  scheme  was  defined  as  the 
adequate  difference  between  a  passage  and  all  pre¬ 
vious  selected  passages.  The  difference  was  cal¬ 
culated  based  on  the  content  terms  in  the  pas¬ 
sages.  The  weight  of  the  terms  was  defined  based 
on  TFIDF.  By  starting  the  selection  from  the  top 
ranked  passages,  our  scheme  identifies  top  five  dif¬ 


ferent  passages. 

We  elicited  two  types  of  judgments  from  users. 
One  type  of  judgments  were  related  to  the  rel¬ 
evance  of  the  passages/documents.  The  pas¬ 
sage/documents  could  be  “not  relevant”,  “on 
topic”  (i.e.,  soft  relevant),  or  “relevant”  (i.e.,  har'd 
relevant).  When  the  passages  were  displayed,  the 
users  were  also  asked  to  judge  the  length  of  the 
passages.  Is  the  passage  “too  short”,  at  the  “right 
length”,  or  “too  long”.  We  used  the  second  type  of 
information  to  fine  tune  the  passage  extent  model 
for  individual  topics. 

4  Experiments 

4.1  Resources 

The  document  retrieval  system  we  used  was  In- 
Query  text  retrieval  system  (version  3.1pl)  from 
the  University  of  Massachusetts.  The  collection 
was  the  full  HARD  04  collection,  which  con¬ 
tains  652,7 10  documents  from  eight  different  news 
sources.  All  the  documents  were  stemmed  using 
InQuery’s  own  stemmer  before  indexing. 

Before  generating  search  questions,  we  pre- 
processed  the  topic  statement.  We  marked  up 
the  named  entities  in  the  topic  statement  by  using 
BBN’s  IdentiFinder,  and  treated  them  as  phrases  in 
queries.  We  also  list  the  terms  in  the  phrases  as  in¬ 
dividual  words  in  the  queries  for  the  case  where 
only  part  of  the  phrases  appealing  in  the  docu¬ 
ments. 

4.2  Experiment  Runs  and  Clarification 
Forms 

We  ran  several  baseline  runs  by  using  title  only 
(run  TITONL),  title  and  description  only  (run  TIT- 
DES),  title  plus  phrases  and  top  weighted  (us¬ 
ing  TFAIDF)  terms  from  description  and  narra¬ 
tive  (run  TFIDF),  and  the  blind  relevence  feedback 
on  top  of  the  previous  three  runs  (each  is  marked 
as  run  TOLBRF,  TDABRF,  and  TIDBRF  respec¬ 
tively  in  this  report). 

We  generated  two  sets  of  clarification  forms. 
CF 1  was  based  on  results  from  run  TFAIDF  plus 
utilizing  the  passage  retrieval  module  to  generat¬ 
ing  passages.  CF2  was  based  on  the  BRF  run  of 
run  TFAIDF  (i.e.,  run  TIDBRF),  and  it  used  the 
same  passage  retrieval  module.  We  obtained  users’ 
answers  for  both  sets. 

Automatic  query  expansion  was  performed 
based  on  the  answers  from  both  CF1  and  CF2 
respectively.  Highly  representative  content  terms 


were  extracted  based  on  TFIDF  scheme  from  all 
the  selected  passages/documents  for  each  topic. 
These  terms  combined  with  users  provided  NEs 
through  clarification  forms  and  the  original  queries 
became  the  expanded  queries.  The  combination 
was  weighted  linearly  with  more  weights  to  orig¬ 
inal  queries  and  elicited  NEs.  Two  document 
retrieval  runs  were  generated  based  on  the  ex¬ 
panded  queries  obtained  through  this  query  expan¬ 
sion  scheme,  each  of  which  corresponds  to  CF1 
and  CF2  respectively.  They  are  marked  as  runs 
EXPCF1  and  EXPCF2. 

These  document  runs  were  then  used  as  the  in¬ 
put  for  both  JF1U  passage  retrieval  models  and 
our  UMD  passage  retrieval  models.  Therefore, 
we  generated  three  passage  retrieval  results  run 
CF1JHU1,  CF1JHU2,  and  CF1UMD,  each  of 
which  corresponds  to  JHU  model  1,  JHU  model 
2  and  UMD  passage  model. 

5  Experiment  Results  and  Discussion 
5.1  Passage  Results 


Figure  2:  R-Precision  difference  between  run 
JHUDOC1  and  the  medians  of  all  the  submitted 
runs  on  passage  preferred  topics. 

As  shown  in  Figure  2,  both  our  runs  CF1JHU1 
(official  run  id  UMAREXPR1)  and  CF1UMD 
(official  run  id  UMAREXPR5)  achieved  rea¬ 
sonable  well  performance  with  most  topics’  re¬ 
sults  (measured  by  R-Precision)  above  the  medi¬ 
ans  of  all  submitted  runs.  The  run  CF1JHU1, 
which  uses  JHU  passage  retrieval  model  Score 
1,  achieved  0.1468  average  R-Precision.  Our 
slightly  improved  UMD  2003  passage  retrieval 
model  achieved  0.0872  in  average  R-Precision. 


The  difference  between  these  two  runs  is  0.0586, 
and  the  difference  is  statistically  significant,  al- 
thought  it  is  just  under  P  <  0.05  using  t-test. 

To  establish  the  potential  of  our  passage  re¬ 
trieval  model,  we  used  the  document  golden  truth 
as  the  input  for  our  JHU  passage  retrieval  model. 
As  what  we  found  in  our  training,  the  passage  re¬ 
trieval  results  improved  dramatically.  When  mark¬ 
ing  passage  extent  in  only  relevant  documents  for 
the  2004  topics,  the  model  based  on  Score  1  yields 
an  R-Precision  of  0.57,  and  that  on  Score  2  yields 
0.55.  This  is  an  upper  bound  on  passage  extent  per¬ 
formance  with  perfect  document  retrieval  on  Hard 
2004  data. 

5.2  Interactive  Clarification 

In  our  result  analysis,  we  established  two  base¬ 
lines  for  the  comparison.  The  run  TFAIDF  men¬ 
tioned  above,  which  was  used  to  generated  CF1, 
does  not  have  any  feedback,  so  it  was  treated  as  a 
low  baseline,  whereas  the  blind  relevance  feedback 
run  (run  TIDBRF)  is  treated  as  a  high  baseline. 
The  experimental  runs  are  run  CF1DOC,  which  is 
the  expanded  document  run  based  on  CF1,  and  run 
CF2DOC,  which  is  the  expanded  document  run 
based  on  CF2. 

Our  expanded  runs  achieved  improvement  over 
both  baselines.  Run  CF1DOC  obtained  21.20over 
the  low  baseline  run  TFAIDF,  and  the  improve¬ 
ments  is  statistically  significant  (t-test  P  <  0.05). 
The  improvement  of  run  CF2DOC  over  the  high 
baseline  TIDBRF  is  23.95(0.31  vs  0.2501),  and  the 
improvement  is  significant  (t-test  P  <  0.05)  too. 
The  first  improvement  is  similar  to  our  last  year’s 
results  (He  and  Demner-Fushman,  2003),  but  the 
second  improvement  is  encouraging,  it  means  that 
the  clarification  interaction  can  be  combined  with 
blind  relevance  feedback,  and  the  improvement 
might  be  even  bigger  than  performing  interactive 
relevance  feedback  without  blind  relevance  feed¬ 
back  first. 

We  then  further  explored  the  effectiveness  of 
eliciting  terms  and  relevance  feeback  on  pas¬ 
sage/documents  seperately.  As  shown  in  Table  4, 
asking  users  to  select  relevant  passages/documents 
yielded  better  improvement  than  elicting  terms 
from  users  (0.2665  vs  0.2481  in  CF1,  0.3188  vs 
0.2671  in  CF2),  and  the  improvement  achieved  by 
the  former  runs  over  their  corresponding  runs  that 
do  not  have  clarification  are  statistically  significant 
(t-test  P  <  0.05),  whereas  that  of  the  latter  runs 


are  not. 
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no-exp  ask  terms  feedback  all 


CF1  0.2212  0.2481  0.2665  0.2681 

CF2  0.250F  0.2671  0.3188  0.3166 

Table  4:  The  effect  of  different  approaches  in  in¬ 
teractive  clarification  measure  by  R-Precision. 


6  Conclusion 

In  this  report,  we  discussed  our  effort  in  explor¬ 
ing  design  options  for  interactive  passage  retrieval 
systems.  We  had  two  research  questions  to  ad¬ 
dress:  1)  how  we  might  better  approximate  hu¬ 
man  determination  of  passage  extent?  and  2)  how 
we  could  optimize  the  utility  of  a  limited  opportu¬ 
nity  for  user  interaction?  Our  preliminary  analysis 
of  the  results  demonstrates  that  our  newly  devel¬ 
oped  passage  retrieval  model  based  on  statistical 
modeling  achieved  significant  improvement  over 
our  2003  passage  retrieval  model,  which  was  one 
of  the  best  passage  retrieval  model  in  2003.  Our 
analysis  also  indicates  that  our  design  of  interac¬ 
tions  through  clarification  forms  generated  signif¬ 
icant  improvement  over  the  baseline  runs  without 
the  interaction,  no  matter  whether  or  not  the  base¬ 
line  employed  blind  relevance  feedback.  We  also 
identified  that  aksing  relevance  feedback  on  docu¬ 
ments/passages  yield  more  improvement. 

Our  future  work  include  further  analyzing  the 
experiment  results,  integrating  users  feedback  on 
the  passage  length  of  individual  topics  into  our 
passage  retrieval  model,  and  comparing  the  stud¬ 
ies  of  interactive  clarificaiton  in  both  HARD  2003 
and  2004. 
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